数据来源 https://www.kaggle.com/datasets/aagambshah/lung-cancer-dataset
数据描述 This dataset contains responses from individuals who participated in a survey to identify behavioral and demographic factors associated with lung cancer. The dataset can be used for exploratory data analysis, statistical modeling, and machine learning classification tasks to predict lung cancer risk.
过程 导入库 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import classification_report, confusion_matrix from sklearn.metrics import accuracy_score, roc_auc_score import torch import torch.nn as nn import torch.optim as optim import data_analysis_tools as dat
查看数据 1 2 df = pd.read_csv('survey lung cancer.csv' )df.head()
我们查看数据特征的意思是什么:
GENDER 受访者性别(男/女)
AGE 受访者年龄
SMOKING 吸烟习惯(是/否)
YELLOW_FINGERS 手指是否变黄(是/否)
ANXIETY 存在焦虑(是/否)
PEER_PRESSURE 经历过同伴压力(是/否)
CHRONIC DISEASE 现有慢性疾病(是/否)
FATIGUE 是否疲劳(是/否)
ALLERGY 过敏情况(是/否)
WHEEZING 喘息症状(是/否)
ALCOHOL CONSUMING 饮酒习惯(是/否)
COUGHING 经常咳嗽(是/否)
SHORTNESS OF BREATH 呼吸困难症状(是/否)
SWALLOWING DIFFICULTY 吞咽困难(是/否)
CHEST PAIN 有无胸痛(是/否)
LUNG_CANCER 肺癌诊断(是/否)
GENDER 0 AGE 0 SMOKING 0 YELLOW_FINGERS 0 ANXIETY 0 PEER_PRESSURE 0 CHRONIC DISEASE 0 FATIGUE 0 ALLERGY 0 WHEEZING 0 ALCOHOL CONSUMING 0 COUGHING 0 SHORTNESS OF BREATH 0 SWALLOWING DIFFICULTY 0 CHEST PAIN 0 LUNG_CANCER 0 dtype: int64
由以上信息可知 该数据集的特征值有15个,target有1个,该数据集是一个二分类问题。 该数据集合没有缺失值 数据集用1、2表示是和否
特征编码 我们将数据的二分类使用0,1表示
[‘GENDER’, ‘AGE’, ‘SMOKING’, ‘YELLOW_FINGERS’, ‘ANXIETY’, ‘PEER_PRESSURE’, ‘CHRONIC DISEASE’, ‘FATIGUE ‘, ‘ALLERGY ‘, ‘WHEEZING’, ‘ALCOHOL CONSUMING’, ‘COUGHING’, ‘SHORTNESS OF BREATH’, ‘SWALLOWING DIFFICULTY’, ‘CHEST PAIN’, ‘LUNG_CANCER’]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 df ['GENDER' ] = df ['GENDER' ].map({'M' : 1, 'F' : 0}) columns = [ 'SMOKING' , 'YELLOW_FINGERS' , 'ANXIETY' , 'PEER_PRESSURE' , 'CHRONIC DISEASE' , 'FATIGUE ' , 'ALLERGY ' , 'WHEEZING' , 'ALCOHOL CONSUMING' , 'COUGHING' , 'SHORTNESS OF BREATH' , 'SWALLOWING DIFFICULTY' , 'CHEST PAIN' ] for column in columns: df [column] = df [column].map({1: 0, 2: 1}) df ['LUNG_CANCER' ] = df ['LUNG_CANCER' ].map({'YES' : 1, 'NO' : 0})
EDA 1 dat.plot_all_barplots(df , hue='LUNG_CANCER' )
可以看到 患有肺癌的人各项水平均比不患有肺癌的人要高
从上述相关性矩阵中我们发现GENDER和ALCOHOL CONSUMING的相关性系数比较高 我们查看他们的关系
1 2 sns.barplot(df , x = 'GENDER' ,y = 'ALCOHOL CONSUMING' , hue = 'LUNG_CANCER' ) plt.show()
可以看到 在男性中 不饮酒的男性患肺癌的概率比饮酒的男性要低 在女性中
1 2 3 4 5 6 7 8 feature_importance = df.corr()['LUNG_CANCER' ].sort_values(ascending=False) feature_importance = feature_importance[1:] plt.figure(figsize=(10, 6)) sns.barplot(x=feature_importance.values, y=feature_importance.index) plt.title('Feature Importance' ) plt.xlabel('Correlation with Lung Cancer' ) plt.ylabel('Features' ) plt.show()
数据划分 1 2 3 4 X = df.drop(['LUNG_CANCER' ], axis=1) y = df ['LUNG_CANCER' ] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
标准化 1 2 3 scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test)
模型建立 随机森林 1 2 3 4 5 6 7 rf = RandomForestClassifier(n_estimators=100, random_state=42) rf.fit(X_train, y_train) y_pred = rf.predict(X_test) report = classification_report(y_test, y_pred) print (report)
precision recall f1-score support
0 0.50 0.50 0.50 2
1 0.98 0.98 0.98 60
accuracy 0.97 62
macro avg 0.74 0.74 0.74 62
weighted avg 0.97 0.97 0.97 62
神经网络 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 X = df.drop(columns=['LUNG_CANCER' ]).values.astype(np.float32) y = df ['LUNG_CANCER' ].values.astype(np.float32) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) scaler = StandardScaler() X_train = scaler.fit_transform(X_train).astype(np.float32) X_test = scaler.transform(X_test).astype(np.float32) X_train_tensor = torch.tensor(X_train, dtype=torch.float32) y_train_tensor = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1) X_test_tensor = torch.tensor(X_test, dtype=torch.float32) y_test_tensor = torch.tensor(y_test, dtype=torch.float32).unsqueeze(1) class BinaryClassifier(nn.Module): def __init__(self, input_dim): super(BinaryClassifier, self).__init__() self.fc1 = nn.Linear(input_dim, 16) self.fc2 = nn.Linear(16, 8) self.fc3 = nn.Linear(8, 1) def forward(self, x): x = torch.relu(self.fc1(x)) x = torch.relu(self.fc2(x)) x = torch.sigmoid(self.fc3(x)) return x input_dim = X_train.shape[1] model = BinaryClassifier(input_dim) criterion = nn.BCELoss() optimizer = optim.Adam(model.parameters(), lr=0.001) epochs = 50 batch_size = 32 for epoch in range(epochs): model.train() running_loss = 0.0 for i in range(0, len(X_train_tensor), batch_size): inputs = X_train_tensor[i:i+batch_size] labels = y_train_tensor[i:i+batch_size] outputs = model(inputs) loss = criterion(outputs, labels) optimizer.zero_grad() loss.backward() optimizer.step() running_loss += loss.item() print (f"Epoch [{epoch+1}/{epochs}], Loss: {running_loss:.4f}" )
Epoch [1/50], Loss: 5.3979 Epoch [2/50], Loss: 5.2241 Epoch [3/50], Loss: 5.0661 Epoch [4/50], Loss: 4.9229 Epoch [5/50], Loss: 4.7911 Epoch [6/50], Loss: 4.6619 Epoch [7/50], Loss: 4.5310 Epoch [8/50], Loss: 4.3962 Epoch [9/50], Loss: 4.2550 Epoch [10/50], Loss: 4.1069 Epoch [11/50], Loss: 3.9532 Epoch [12/50], Loss: 3.7950 Epoch [13/50], Loss: 3.6341 Epoch [14/50], Loss: 3.4730 Epoch [15/50], Loss: 3.3150 Epoch [16/50], Loss: 3.1634 Epoch [17/50], Loss: 3.0199 Epoch [18/50], Loss: 2.8862 Epoch [19/50], Loss: 2.7630 Epoch [20/50], Loss: 2.6512 Epoch [21/50], Loss: 2.5504 Epoch [22/50], Loss: 2.4604 Epoch [23/50], Loss: 2.3804 Epoch [24/50], Loss: 2.3079 Epoch [25/50], Loss: 2.2413 … Epoch [47/50], Loss: 1.3242 Epoch [48/50], Loss: 1.3006 Epoch [49/50], Loss: 1.2785 Epoch [50/50], Loss: 1.2578
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 model.eval() with torch.no_grad(): y_pred_prob = model(X_test_tensor).numpy().flatten() y_pred = (y_pred_prob > 0.5).astype(int) accuracy = accuracy_score(y_test, y_pred) auc = roc_auc_score(y_test, y_pred_prob) print ("分类报告:" )print (classification_report(y_test, y_pred))print (f"测试集准确率: {accuracy:.4f}" )print (f"测试集 AUC-ROC: {auc:.4f}" )
分类报告:
precision recall f1-score support
0.0 0.50 0.50 0.50 2
1.0 0.98 0.98 0.98 60
accuracy 0.97 62
macro avg 0.74 0.74 0.74 62
weighted avg 0.97 0.97 0.97 62
测试集准确率: 0.9677 测试集 AUC-ROC: 0.9500